library(tidyverse)
demographics = read.csv('data/demos_anonymized.csv')
ids = read.csv('data/ids_anonymized.csv')
model_variables = read.csv('data/model_variables_anonymized.csv')
It’s always a good idea to glance at the data
glimpse(demographics)
summary(demographics[,1:10])
A very common operation is to do things by group(s) then create new summary variables.
demographics %>%
group_by(libuser) %>%
summarise(age = mean(age, na.rm = T))
Do multiple operations at once
demographics %>%
group_by(libuser) %>%
summarise(age_mean = mean(age, na.rm = T),
age_sd = sd(age, na.rm = T),
age_max = max(age, na.rm = T),
prop_male = mean(gender=='Male', na.rm = T))
We’ll use the map function to map the sum function to each element in the list
x = list(1:3, 4:6, 7:9)
map(x, sum)
We can even do this with modeling and other operations.
This extracts the coefficients from a model run for each group.
demographics %>%
drop_na(race) %>%
group_split(race) %>%
map(~lm(award_total_amount ~ gender, data = .))
Visualizing data requires that you: - Consider carefully the information you want to display - And then how you want to display it
Tell a story with the data.
And have some fun with it!
The most widely used visualization package in R
Visualization can be thought of in a layered fashion
Start with the base, then build up
ggplot2 uses a + as a pipe to add layers
library(ggplot2)
ggplot(data) +
geom_point(aes(x = var1, y = var2)) +
geom_line() +
theme(plot.caption = element_text(size = 6))
We can pipe to any ggplot as we did before (%>%)
Aesthetics (aes) map variables to visual properties
Geoms are the geometric units we want to display
model_variables %>%
filter(award_total_amount > 1e7) %>%
ggplot() +
geom_density(aes(x = award_total_amount,
color = factor(gender),
fill = factor(gender)),
alpha = .2) +
scale_x_continuous(breaks = (1:10) * 1e7, trans = 'log')
demographics %>%
filter(award_year_start < 2020 & award_year_start > 1990) %>%
ggplot(aes(award_year_start, award_total_log)) +
geom_smooth(aes(color=factor(libuser)))
We can also use ggplot2 to create statistics we want to visualize
Typically used indirectly when geoms are called
Can be used for more direct control
ggplot(model_variables, aes(age, award_total_amount)) +
geom_point(alpha = .02) +
stat_ellipse(color = '#ff5500')
Scales are used to add specifications to axes, colors, etc.
model_variables %>%
filter(award_total_amount >= 1e6) %>%
ggplot() +
geom_density(aes(x = award_total_amount,
color = gender,
fill = gender),
alpha = .2) +
scale_x_continuous(breaks = c(1e6, 5e6, 1e7, 2.5e7, 5e7),
trans = 'log') +
scale_fill_viridis_d(begin = .25, end = .5) +
scale_color_viridis_d(begin = .25, end = .5)
Facets allow another dimension to plots by group
model_variables %>%
ggplot() +
geom_density(aes(x = award_total_amount,
color = libuser,
fill = libuser),
alpha = .2) +
facet_wrap(~gender)
Themes allow for customization
Two uses of a - a built-in versions (e.g. theme_minimal) - DIY (theme(…))
For the theme function, each argument, takes on a specific value or an element function:
element_rectelement_lineelement_textelement_blankmodel_variables %>%
ggplot() +
geom_smooth(aes(x = age,
y = award_total_amount,
color = libuser),
alpha = .2) +
theme_minimal()
model_variables %>%
ggplot() +
geom_smooth(aes(x = age,
y = award_total_amount,
color = libuser),
alpha = .2) +
theme(axis.text.x = element_text(size=12),
panel.grid.minor.x = element_blank(),
plot.background = element_rect(color = 'rosybrown'),
panel.background = element_rect(fill = 'papayawhip'))
Interactivity is a must-have tool for web-based presentation
Use to enhance exploration of the data - Not just because one can
Allows for additional dimensions
Even useful for exploring raw data
General
ggplot2 images to interactive onesSpecific functionality:
traces - add_, work similar to geoms
modes - allow for points, lines, text and combinations
aesthetics - variables are denoted with ~, constants do not use - x =~ var1 vs x = 2
Plotly uses the standard pipe %>%
library(plotly)
model_variables %>%
plot_ly(x = ~gender, y = ~ age) %>%
add_boxplot(color =~ gender)
library(plotly)
model_variables %>%
group_by(libuser, gender) %>%
summarise(award = mean(award_total_amount)) %>%
plot_ly(x = ~libuser,
y = ~award,
color = ~gender,
text = ~round(award),
textposition = 'auto',
type = 'bar') %>%
layout(bargap = 0.25,
bargroupgap = 0.25)
init = glm(award_total_amount >= 5000000 ~ age*libuser,
data = model_variables,
family = binomial)
model_variables %>%
modelr::add_predictions(init, type = 'response') %>%
plot_ly(x = ~age, y = ~ pred) %>%
add_lines(color =~ libuser, line = list(shape = "spline")) %>%
layout(title = 'Predicted Prob. Award > 1 mil')
Use ggplotly to turn our formerly static plots into interactive ones.
p = model_variables %>%
ggplot() +
geom_density(aes(x = log(award_total_amount),
color = libuser,
fill = libuser),
alpha = .2) +
facet_wrap(~ gender)
ggplotly()
Python has come a long way in terms of data processing. There is a misconception that it is faster and less memory-hungry than R, but this depends on many factors and is generally not true.
# note how when using something other than R, you have to specify the engine path
import pandas as pd
import numpy as np
demographics = pd.read_csv('data/demos_anonymized.csv')
ids = pd.read_csv('data/ids_anonymized.csv')
model_variables = pd.read_csv('data/model_variables_anonymized.csv')
demographics.describe(include = 'all')
demographics.describe(include = [np.number])
demographics.describe(include = [np.object])
lib_group = demographics.groupby('libuser', sort=True, )
# automatically chooses numeric
lib_group.mean()
lib_group.get_group(0).head()
lib_group.size()
lib_group.describe()
x = [[1,2,3], [4,5,6], [7,8,9]]
list(map(np.sum, x))
demographics.select_dtypes('number').apply(np.mean)
matplotlib is the most common visualization module in Python, though it’s fairly dated at this point. As such we’ll use a ggplot implementation in Python called plotnine.
Unfortunately for plotly, the interactivity makes it unusable within the R notebook (at present), so you may need to switch to Anaconda or other IDE to try other modules like plotly. Even Python users will still use R for easier visualization though, so feel free to do what you like there, then use ggplot etc. in R when the time comes.
That said, I’ll show a couple plots
import plotnine
dplot = demographics[(demographics.award_year_start < 2020) & (demographics.award_year_start > 1990) & (demographics.award_total_log > 11)]
dplot
ggplot(dplot, aes(x='award_year_start', y='award_total_log')) + geom_smooth(aes(group = 'libuser', color = 'libuser'))
year_award_average = demographics[demographics.libuser == 1].groupby(['award_home_dept', 'award_year_start']).mean()
year_award_average
Example boxplot with plotly.
import plotly
import plotly.plotly as py
from plotly.offline import init_notebook_mode
import plotly.graph_objs as go
plotly.offline.init_notebook_mode(connected=True)
y0 = np.random.randn(50)-1
y1 = np.random.randn(50)+1
trace0 = go.Box(
y=y0
)
trace1 = go.Box(
y=y1
)
data = [trace0, trace1]
py.iplot(data)